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Abstract 
In this chapter, we provide an overview of the design, data-collection, and data-analysis 
efforts for a digital learning and assessment environment for scientific inquiry / science 
practices called Ing-ITS (Inquiry /ntelligent Tutoring System; www.inqits.org). We first 
present a brief literature review on current science standards, learning sciences research 
on students’ difficulties with scientific inquiry practices, and modern assessment design 
frameworks. We then describe how we used pilot data from four case studies with hands- 
on inquiry tasks for middle school students to better understand these difficulties and 
design various components of the Ing-ITS system to support students’ inquiry 
accordingly. Lastly, we describe how we used key computational techniques from 
knowledge-engineering and educational data mining to analyze data from students’ log 
files in this environment to (1) automatically score students’ inquiry skills, (2) provide 
teachers with fine-grained, rich, classroom-based formative assessment data on these 


practices, and (3) react in real time to scaffold students as they engage in inquiry. 


Key words: Digital assessment environment, scientific inquiry practice, Inq-ITS, 
intelligent tutoring system, educational data mining. 
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Despite the billions of dollars spent on education every year in the U.S., the 
expenditures do not result in superior test results for American students (Hanushek, 
2005). Specifically, American students continue to underperform in science compared to 
other developed countries. For example, in 2013, the United States ranked 21st 
worldwide on a key educational survey called the Program for International Student 
Assessment (PISA; Organization for Economic Co-operation and Development, 2014). 
There are a few reasons contributing to the poor test scores of American students on 
international comparisons of science competency such as PISA. 

First, the current public school system was modeled on factories with a “one-size 
fits all” approach to teaching (Christensen, Horn, & Johnson, 2008). This approach does 
not recognize the various dimensions on which students differ, including but not limited 
to: prior content knowledge, skills to conduct science inquiry, epistemological 
understanding of science, and engagement and/or motivation for science learning, all of 
which influence school performance. 

Second, standardized tests, which typically use multiple-choice and fill-in-the- 
blank items that tap rote science “facts” as a measure of science content knowledge are 
not measuring the knowledge and competencies proposed by national frameworks such 
as the Next Generation Science Standards (NGSS; NGSS Lead States 2013). Key 
competencies required by the NGSS include, for example, asking questions, planning and 
carrying out experiments, analyzing and interpreting data, warranting claims with 
evidence, and communicating findings (Clarke-Midura, et al., 2011; deBoer et al., 2008; 
Haertel, Lash, Javitz, & Quellmalz, 2006; Quellmalz & Haertel, 2004; Quellmalz, 
Kreikmeier, DeBarger, & Haertel, 2007). 

The standardized, multiple choice tests developed years ago are no longer 
sufficient as a measure of 21* century skills and knowledge as described in the NGSS, 
which include process skills for ‘doing science’ (i.e., science practices) as critical aspects 
of science literacy (Perkins, 1986) so that people can apply and transfer their knowledge 
in flexible ways (NGSS Lead States, 2013). That is, these tests do not provide basic 
information about higher-order thinking in science such as students’ processes and 


reasoning (Leighton & Gierl, 2011). As discussed elsewhere (Gobert et al., 2013), the 
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limitations of these multiple choice tests are in part an artifact of a simplified 
conceptualization of what constituted science understanding at the time these 
accountability tests were designed (diCerbo & Behrens 2012; Mislevy et al., 2012). 

A third but related barrier to cultivating scientifically literate students is the lack 
of systems that can provide individualized, real time scaffolding of students’ science 
practices. From multiple choice tests designed for accountability purposes, educators 
cannot know who needs help and, as a result, any kind of feedback to students that is 
needed for deep learning from these tests is given too late to be formative - typically 
months after the school year has ended. As a result, many students struggle in silence as 
confirmed by prior literature and consistent data showing poor motivation for and 
disengagement from science learning (see, e.g., Gobert, Baker, & Wixon, 2015). 
Moreover, due to the fact that science inquiry is an ill-defined task, there are a myriad of 
ways in which students conduct inquiry, both when they are on the “right” track and 
when they are not (Kuhn, 2005). Although progress on real time systems for well-defined 
domains like math and computer science has been made (cf., Koedinger & Corbett, 2006; 
Corbett & Anderson, 1995), there are few to no systems that adapt to individual learners 
as they conduct science inquiry. 

These assessment challenges and needs for adaptive instruction have led to the 
development of new technology-centered measurement paradigms for science education 
(see Timms, Clements, Gobert, Ketelhut, Lester, Reese, & Wiebe, 2012). At present, key 
organizations such as the National Center for Education Statistics that is responsible for 
the National Assessment of Educational Progress (NCES, 2011), the Organisation for 
Economic Co-operation and Development that is responsible for PISA (OECD, 2014), 
the National Educational Technology Plan, and the National Research Council (NRC, 
2011) all acknowledge the benefits and potential of technology-based systems for the 
assessment of science inquiry (see chapter by Oranje, this volume). Computer-based 
environments like the one that we describe in this chapter along with others (see, e.g., 
Quellmalz et al, 2012; Ketelhut & Dede, 2006; Clarke et al, 2012; Leeman-Munk, Wiebe, 
& Lester, 2013; Timms, et al., 2012) are providing new possibilities for assessing science 


and are now being considered as alternatives to traditional assessments of inquiry 
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(Behrens, 2009; Gobert et al., 2012, 2013; Pellegrino, Chudowski, & Glaser, 2001; 
Quellmalz et al., 2009). 

In the following, we provide an overview of the design, data-collection, and data- 
analysis efforts for our digital assessment environment for scientific inquiry / science 
practices called Jng-ITS (Inquiry /ntelligent Tutoring System; www. inqits.org). 

Inq-ITS is an example of a system that was designed and instrumented to generate 
assessments of students’ science practices as they engage in rich, authentic inquiry tasks. 
Inq-ITS was explicitly designed to prioritize the assessment of inquiry rather than the 
learning of science, which many other curriculum-focused inquiry systems emphasize 
(cf., Linn & Hsi, 2000). Our work builds on the inquiry and assessment research of others 
and forges new ground for inquiry assessment and scaffolding of science with its 
application of data mining techniques (Gobert et al, 2013). 

We have organized this chapter into three main sections as follows. In the first 
section we review key literatures on science learning policy, students’ difficulties with 
science inquiry, and assessment design to provide a background for the design of the Inq- 
ITS system. In the second section, we describe how we used pilot data from four case 
studies with hands-on inquiry tasks for middle school students to better understand these 
difficulties and design various components of the Inq-ITS system to support students’ 
inquiry accordingly. In the third section, we describe how we used key computational 
techniques from knowledge-engineering and educational data mining to analyze data 
from students’ log files to (1) automatically score students’ inquiry skills, (2) provide 
teachers with fine-grained, rich, classroom-based formative assessment data on these 
practices, and (3) react in real time to scaffold students as they engage in inquiry. 

Foundations for the Design of the Inq-ITS System 
NGSS 

The NGSS, as the National Research Council’s new framework for K-12 Science 
Education, emphasize content learning as well as inquiry practices, as did its predecessors 
(see NSES, 1996). However, in the newest framework greater emphasis is placed on the 
rich integration of authentic practices in science with disciplinary content knowledge so 
that students will possess well-honed learning strategies that can be transferred in more 


flexible ways. These inquiry practices are: 
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. Asking questions (for science) and defining problems (for engineering) 

. Developing and using models 

. Planning and carrying out investigations 

. Analyzing and interpreting data 

. Using mathematics and computational thinking 

. Constructing explanations (for science) and designing solutions (for engineering) 


. Engaging in argument from evidence, and 
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. Obtaining, evaluating, and communicating information 


The NGSS also prescribe that American middle school students develop content 


understanding of topics as shown in Table 1. 


[INSERT TABLE 1 ABOUT HERE] 


Finally, the NGSS also describe six cross-cutting science concepts: 


1. Cause and effect 

2. Scale, proportion, and quantity 
3. Systems and system models 

4. Energy and matter 

5. Structure and function, and 

6. Stability and change. 


Given the importance placed on rich science inquiry practices that are aligned to the 
needs of the 21* century and the poor performance in science by American students in 
studies such as PISA, there follows a need for better assessments that can provide more 
fine-grained data about ‘how’ students are learning (or ‘not’ learning, as appears to be the 
case for many students). 


Learning Sciences 
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Models for learning. There were several related bodies of literature that we drew 
on in the design of our system specifically. Briefly, the main literature includes: causal 
models (White & Frederiksen, 1990; Schauble et al, 1991; Raghavan & Glaser, 1995), 
visualization generation and comprehension (Gobert, 1994; Gobert & Frederiksen, 1988; 
Gobert, 2005; Kindfield, 1993; Larkin & Simon, 1987; Lowe, 1989), mental models 
(Gentner & Stevens, 1983; Johnson-Laird, 1983) and model-based learning (Gobert & 
Buckley, 2000; Harrison & Treagust, 2000), as well as the vast body of literature on 
students’ alternative conceptions (Pfundt & Duit, 1988; Driver, 1983) and difficulties 
with inquiry (cf., Kuhn, 2005). 

Most relevant to the theoretical framework that undergirds our system is how 
people learn with rich visual representations, specifically model-based learning; see 
Figure | for a graphical representation of this framework. The design of our system, its 
scaffolds, and other assessment components, are based on model-based teaching and 
learning (Gilbert, J. 1993; Gilbert, S., 1991; Gobert & Buckley, 2000). This, in turn, 
prescribed the design of the interface as well as the design of the tools and widgets in 
Inq-ITS, which we describe further below. 

[INSERT FIGURE 1 ABOUT HERE] 

Briefly, model-based learning (Clement, Brown, & Zietsman, 1989; Gobert & 
Buckley, 2000; Gobert & Clement, 1999; Harrison & Treagust, 2000) is a theory of 
science learning that integrates basic research in cognitive psychology and science 
education. The tenets of model-based learning are based on the presupposition that deep 
understanding requires the construction of mental models of the phenomena under study, 
and that all subsequent problem-solving, inference making, or reasoning are done by 
“running” and manipulating these mental models (Johnson-Laird, 1983). We view mental 
models as internal cognitive representations used in reasoning (Brewer, 1987; Rouse & 
Morris, 1986); thus, we define model-based learning as a dynamic, recursive process of 
learning by constructing mental models of the phenomenon under study. It involves the 
formation, testing, and subsequent reinforcement, revision, or rejection of those mental 
models (Gobert & Buckley, 2000; Clement, 1993; Stewart & Hafner, 1991). This is 
analogous to hypothesis development and testing seen among scientists (Clement, 1989) 


and also, we argue, a form of reasoning used in conducting inquiry. 
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In our theoretical framework, we also use D. Norman’s contemporary definition 
of an affordance (Norman, 1983). That is, the ways in which one learns from a visual 
representation (i.e., constructs a mental model) and the features that the microworld 
affords the learner are dependent on the learner’s knowledge, skills, predispositions, and 
other characteristics. This is a more contemporary interpretation of the original use of the 
term affordance by the perceptual psychologist J.J. Gibson, who claimed that an 
affordance of a representation (or object) is independent of the user’s knowledge, skills, 
predispositions, and other characteristics (Gibson, 1977). Thus, the prior knowledge, 
epistemological frameworks, inquiry skills, and other variables related to the learner that 
s/he brings to bear when engaging in inquiry with the representation will play a role in 
the nature of the resulting mental model (Gentner & Stevens, 1983; Johnson-Laird, 
1983). 

The notion of what features provide affordances and for whom affected both our 
interface and scaffolding design. That is, since many students lack the necessary domain 
knowledge to guide their search processes through diagrams/models during learning 
(Lowe, 1989; Gobert, 1994; Gobert & Clement, 1999), our system and its scaffolds need 
to support learners so that their inquiry can be productive, resulting in the construction of 
rich mental models with which they can engage in sophisticated model-based reasoning. 

Students’ difficulties with scientific inquiry. As part of our design, we needed 
to know what difficulties middle school students have when conducting inquiry. We 
conducted a thorough literature review on students’ difficulties with inquiry to better 
understand the nature of these. This helped us to both concretize the sub-skills underlying 
the inquiry practices identified in the NGSS, as well as design tools and widgets to guide 
students in conducting inquiry. 

Many studies have shown that students have difficulty with inquiry learning in 
general. Students do not plan which experiments to run (Glaser et al., 1992) and can act 
randomly (Schauble, Glaser, et al., 1991). They have difficulty setting goals (Charney, 
Reder, Kusbit, 1990), monitoring their progress (de Jong et al., 2005; de Jong, 2006), and 
recording their progress (Harrison & Schunn, 2004). Furthermore, they concentrate more 
on executing procedures than what they might learn or infer from experimenting, and if 


the data-gathering process is lengthy, they lose track of why they are collecting data 
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(Krajcik et al., 2000). Students have also been shown to demonstrate some difficulties in 
terms of certain very specific inquiry skills targeted in the national frameworks for which 
we sought to design assessment metrics and scaffolds. We discuss findings from prior 
research on five skills that were critical to the design of Int-ITS briefly in the following. 

First, when ‘generating hypotheses’ - referred to as asking questions in the NGSS 
- students may have difficulties choosing which variables to work with (Chinn & Brewer, 
1993; Klahr & Dunbar, 1988; Kuhn et al., 1995), including identifying the proper 
independent variable (Richardson, 2008). They may also have difficulty translating and 
understanding how theoretical variables and manipulable variables relate to each other 
(van Joolingen & de Jong, 1997; Glaser et al., 1992). 

Second, when ‘planning and carrying experiments’, students may not test their 
articulated hypotheses (van Joolingen & de Jong, 1991, 1993; Kuhn, Schauble, Garcia- 
Mila, 1992; Schauble, Klopfer, Raghavan, 1991) or may gather insufficient evidence to 
test hypotheses (Shute & Glaser, 1990; Schauble, Glaser et al., 1991) by running only one 
trial (Kuhn, Schauble, Garcia-Mila, 1992) or running the same trial repeatedly (Kuhn, 
Schauble & Garcia-Mila, 1992; Buckley, Gobert & Horwitz, 2006). They may also 
change too many variables (Glaser et al., 1992; Reimann, 1991; Tschirgi, 1980; Shute & 
Glaser, 1990; Kuhn, 2005; Schunn & Anderson, 1998, 1999; Harrison & Schunn, 2004; 
McElhaney & Linn, 2008, 2010), may run experiments that try to achieve an outcome 
(e.g., make something burn as quickly as possible), or may design experiments that are 
enjoyable to execute or watch (White, 1993), as opposed to actually testing a hypothesis 
(Schauble, Klopfer & Raghavan, 1991; Schauble, Glaser, Duschl, Schulze & John, 1995; 
Njoo & de Jong, 1993). 

Third, when ‘analyzing and interpreting data’ students may show confirmation 
bias (i.e., they will not discard a hypothesis based on negative results) (Klayman & Ha, 
1987; Dunbar, 1993; Quinn & Alessi, 1994; Klahr & Dunbar, 1988; Dunbar, 1993). They 
may draw conclusions based on confounded data (Klahr & Dunbar, 1988; Kuhn, 
Schauble & Garcia-Mila, 1992; Schauble, Glaser, Duschl, Schulze & John, 1995), they 
may not relate outcomes of experiments to theories being tested (Schunn & Anderson, 
1999), and they may reject theories without disconfirming evidence (Klahr & Dunbar, 
1988). They may have difficulty linking data back to hypotheses (Chinn & Brewer, 1993; 
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Klahr & Dunbar, 1988; Kuhn et al, 1995), may have difficulty interpreting data displays 
like graphs, or may have difficulty interpreting important differences between related 
variables (e.g., time to evaporate vs. rate of evaporation) shown in data displays (cf. 
McDermott, Rosenquist, & van Zee, 1987). 

Fourth, when ‘constructing explanations’, students may be overly reliant on 
theoretical arguments as opposed to evidence (Kuhn, 1989, 1991; Kuhn, Katz & Dean, 
2004; Ahn et al., 1995; Ahn & Bailenson, 1996; Brem & Rips, 2000; Schunn & 
Anderson, 1999), they may struggle to provide appropriate evidence for their claims 
(McNeill & Krajcik, 2007), or they may analyze data so as to protect prior beliefs, which 
can lead to faulty causal attribution (Kuhn et al., 1995; Keselman, 2003; Kuhn & Dean, 
2004). 

Finally, when ‘communicating findings’, students may have difficulty articulating 
and defending claims (Sadler, 2004). They may tend to focus on what they did as 
opposed to what they found out, may not link data and conclusions, and may not relate 
results to their own knowledge/questions (Krajcik, et al., 1998). They may also struggle 
to provide reasoning to describe why evidence supports claims (McNeill & Krajcik, 
2007). 

Assessment System Design 

Intuitively speaking, it makes sense to design educational assessments based on 
the vast literature from the learning sciences summarized in “How People Learn” 
(Bransford, Brown, & Cocking, 2000). Briefly, the literature in that volume spans from 
the onset of the information-processing perspective on learning circa 1960 to present day 
and has been extremely informative regarding the role of prior knowledge in learning, the 
nature of mental models and their role in reasoning, as well as domain-specific content 
learning and teaching, including science (Duschl, Schweingruber, & Shouse, 2007). 

In a white paper report that extended “How People Learn”, Pellegrino (2009) 
outlined some key principles that are very noteworthy in guiding the design of 
assessments of NGSS practices so that the resulting data can be used to “educate and 
improve student performance, rather than merely to audit it” (Wiggins, 1998, p. 7). In 
brief, these principles emphasize that assessments should be integrated with 


curricular/instructional needs including domain-subject matter learning (for science, this 
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includes content and practices) and that assessments need to be framed by current 
theories and data about student cognition and learning, including learning progressions 
and expert-novice differences, in addition to students’ prior knowledge. 

From a system designer’s perspective, the practical assessment problem becomes 
the following: how do we take the policy documents about what science literacy is and 
what we expect students to be able to know and do in science and use these to inform the 
design and development of a valid and reliable system capable of generating 
performance-based assessments? How do we design a system that is scalable to large 
numbers of users so that science literacy on a broad scale can be realized? It is clearly 
necessary that the new assessment permit better inferences about students’ knowledge, 
skills, and inquiry practices when compared to more traditional multiple choice tests , 
while still providing evidence of both validity and reliability, while being neither too 
expensive nor too laborious to construct. 

Properties of assessment systems. Leighton & Gierl (2011) offer three indices 
for evaluating assessment items/systems, namely granularity, measurability, and 
instructional relevance. Briefly, granularity refers to the depth and breadth of the 
knowledge and skills being measured by the system. Specifically, to permit inferences 
about what students know, underlying cognitive models must be 
described/collected/reported at a level of specificity that will provide meaningful 
information about students’ performance so that teachers (or the system itself) can 
provide necessary feedback. If developed at the right level of granularity, a teacher can, 
in turn, use these formative data to inform their instruction. Alternatively, support in the 
form of scaffolding can be done in real time via an automated pedagogical agent as in the 
Inq-ITS system that we describe in this chapter. 

The second criterion for evaluating an assessment is its measurability in order to 
link learning with assessment. Specifically, the knowledge and skills in the cognitive 
model must be described in a way that would allow a developer to create a test item or 
task to measure that particular knowledge or skill. Later in this chapter we describe how 
we used information from four case studies to develop articulations of the knowledge and 


skills for scientific inquiry practice for our Inq-ITS system. 
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The third criterion for evaluating assessments is instructional relevance. That is, 
in developing a cognitive model, the knowledge and skills must be instructionally 
relevant and meaningful to the relevant group of educational stakeholders such as 
teachers, superintendents, and policy-makers. For example, teachers need highly 
actionable data that are easy to understand and use in real time to inform instruction 
(Huff & Goodman, 2007). Instructional relevance is generally related to grain size. For 
example, when data are derived from students’ logs - as is the case in with our Inq-ITS 
system - the data must be aggregated from their finer-grained level up to a level that is 
instructionally relevant for teachers for the purposes of instruction and scaffolding. 

Evidence-centered design. Despite the rich theoretical frameworks from the 
learning sciences and vast amount of findings on how people learn science, the 
development of resources to assess science is lagging behind (Leighton & Gierl, 2011). 
Specifically, the broad and deep literature base from the learning sciences about what 
students know and the types of knowledge they use in reasoning should be used to guide 
the design of test items and tasks for assessment. If designed in this way, there is greater 
potential to strengthen validity arguments regarding the inferences that can be made 
about students’ knowledge from such items (Leighton & Gierl, 2011). 

Delving a level deeper in terms of its specificity for guiding the development of 
assessments for science in particular, Mislevy and colleagues (e.g., Mislevy et al., 2012) 
thus proposed the evidence-centered design (ECD) framework. This framework describes 
how the analysis of key practices in a domain can be used to inform the design of 
assessments for that domain. Domain analysis is similar in spirit to task analysis as 
described in the information-processing literature (Newell & Simon, 1972), but ECD is 
has the explicit goal of assessment design, whereas task analysis is typically used to 
characterize learning (Newell, 1990). 

In calibrating a system for assessment purposes, Mislevy et al (2012) explicitly 
state how domain analysis and subsequent domain modeling processes are used to inform 
the design of the conceptual assessment framework within the ECD framework. In the 
context of Inq-ITS, scientific inquiry practices to be assessed are specified in a student 
model and then connected to a task model that specifies features of tasks as well as 


questions that would elicit the evidence of learning. Observable behaviors then result in 
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indicators that are used for evidence identification and accumulation processes within an 
evidence model that specify the nature of student responses that indicate levels of 
proficiency; for a full description of how our ECD models were derived see Gobert et al. 
(2012). 

We derived our cognitive model for the sub-skills underlying the inquiry practices 
from our think-aloud data from the four case studies, which we discuss in the next 
section, and the previously reviewed literature on students’ difficulties with scientific 
inquiry. In Ing-ITS, the task model includes the activities conducted in the microworld 
that reveal students’ proficiencies for each sub-skill of interest. Finally, the evidence 
model specify how one uses work products (1.e., end-state products) and processes (1.e., 
actions/behaviors as indicated in their log files) to assess students’ inquiry practices. 
These data are then aggregated and analyzed to yield performance indicators that are used 
as evidence of students’ proficiencies for each inquiry practice and their respective sub- 
skills. 

To create automated evidence-based assessment summaries within a complex 
system like Inq-ITS, it is critical to know how to leverage modern computational 
techniques in order to analyze the rich log file data that are generated and captured as 
students engage in scientific inquiry tasks. Although there are many on-line learning 
environments for science, few are leveraged to assess the skills that they were designed to 
foster (Quellmalz et al., 2009). Computational techniques adapted or adopted from 
domains such as computer science or educational data mining in particular are necessary 
to handle the analysis of data both in terms of the grain-size and volume of log files that 
are generated in rich interactive systems (Behrens, 2013; Gobert et al, 2013; Mislevy et 
al, 2012). 

In line with ECD thinking, it is critical to concretize the sub-skills underlying the 
inquiry practices at a level of granularity that informs decisions about what kinds of data 
along with what kinds of computational techniques to generate assessment metrics of 
inquiry practices are needed. This design work needs to be done before the computational 
techniques can be developed for assessment rather than post hoc once the environments 
are already created. In the third section we describe how we analyzed our data by 


leveraging both knowledge-engineering and educational data mining techniques. The 
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resulting computational techniques were designed to handle both the fine grain size of 
our data and its large volume in order to assess students and scaffold them in real time 
(Gobert et al., 2012, 2013 provides a thorough description of this approach). 

Case Studies with Think-aloud Components 

Using the above literatures as a basis, we designed and conducted a series of one- 
on-one case studies with students using think aloud protocols (Richardson, 2008). Think- 
aloud protocols are assumed to present a "trace" of the learner's cognitive processes in 
that the object being described as a person thinks out loud is assumed to be information/ 
knowledge that is currently being attended to in execution of the task (Ericsson & Simon, 
1980). Methods used to analyze think aloud data can be very fine-grained. Some methods 
of protocol analysis are done at the propositional (i.e., basic idea unit) level (see 
Frederiksen, 1975; 1986) or at the clause level (Chi, 1997), providing key information 
about the semantic units underlying thinking. Think-aloud data have been used to provide 
information about particular facets of task performance that can be used to develop 
canonical models for software development (Ericsson & Simon, 1980). 

In order to inform the initial design of our Inq-ITS environment, four case studies 
with think-aloud components were conducted in our partner middle schools. Across the 
four case studies, we sought to (1) characterize how middle school students naturally 
approach scientific inquiry tasks, (2) develop a set of scaffolds for inquiry to be 
integrated into a technical environment, and (3) determine the effectiveness of various 
prompts and scaffolding tasks at fostering inquiry practices. 

Case Study 1 

The first case study was designed to characterize how students naturally approach 
an inquiry task, to get a sense of what the students already understood in terms of inquiry, 
and to better understand common areas of weakness with respect to both inquiry in 
general as well as areas of weakness within designing controlled experiments specifically 
(Chen & Klahr, 1999). In this case study, fourteen randomly selected middle school 
students were selected from a range of class levels including high, average, and lower- 
performing students. Each student was tested on an individual basis. Students were 
presented with a physical ramp apparatus whose features (steepness, run length, the type 


of ball, and surface) could be changed. They also were given blank pieces of paper and a 


14 
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pencil to record their findings. 

Students were first asked which features they thought they could change on the 
ramp that would affect how far the ball would roll. They were then told that these 
features were called variables. Next, they were asked to state a hypothesis and run an 
experiment to test their hypothesis on how the steepness of the ramp affects how far the 
ball would roll. After running the experiment, the students were asked to explain, based 
on their data, how steepness affects how far the ball rolls. When the students finished 
testing how steepness affects how far the ball rolled, they moved on to test the effect of 
run length, the type of ball, and surface again on how far the ball rolled. Then they were 
asked to reflect back on the first experiment and say what they would have done 
differently if they were to run it again. This question was asked to test if they had 
acquired meta-knowledge about how to conduct controlled experiments. 

The prompts in this case study, which were intended to be a “gentle guide” toward 
improving the student’s strategies, were not prewritten. If a student continued to use the 
same type of inquiry strategy, the next prompt given was slightly more direct. For 
example, if a student was demonstrating “buggy” inquiry, the student was given a prompt 
and then given more time to interact with the apparatus to revise his/her strategy and re- 
run the experiment. The approach of providing progressively more direct scaffolds was 
later incorporated into the Inqg-ITS system. 

The data from this study included all of the notes and tables made by each student 
during the experiment. Additionally voice data was analyzed to determine which inquiry 
skills the students struggled with and which prompts were helpful in improving inquiry 
skills. The data showed that, although students were generally fairly good at articulating a 
hypothesis (e.g., they included an independent variable, a dependent variable, and 
specified a relationship between them), they showed difficulties with other inquiry 
practices, many of which were previously described in the inquiry literature reviewed 
above. 

First, students did not naturally seek to record their data and needed both 
prompting and a great deal of help in recording data. Specifically, many did not know 
how to record the data in columns with the values for each independent variable in one 


column and the resulting dependent variable in another column. Second, with respect to 
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designing and conducting experiments, most students did not target one variable by 
changing only that one and keeping all others the same. Many students also collected data 
from a single trial, collected data for the same trial repeatedly, or collected data merely to 
reaffirm their initial hypothesis without considering alternative explanations. When 
describing their findings, students did not include data in their explanations; that is, they 
did not provide evidence for their claims with data from their table(s). It was notable, 
however, that with some experimenter scaffolding, students’ performance was better 
across trials at making a table. Since recording data is a graphical literacy skill as 
opposed to a science inquiry skill, data on students’ difficulties on recording their data 
led us to better understand the importance of providing an auto-populated data table for 
students in the design of Ing-ITS. 
Case Study 2 

In a second case study, we collected data from demographically-similar students 
attending one of our partner schools, a lower SES school in Central Massachusetts. All 
materials, including the physical ramp apparatus, data collection, and recording 
procedures were identical to case study 1. However, in this study we administered a short 
pretest and posttest of inquiry skills, we provided students with a lab book, and used 
more formalized inquiry prompts as part of the data collection procedure. Similar to study 
1, we found that the students did not record findings without prompting to do so. When 
they had trouble with conducting controlled trials, the experimenter showed them some 
data in the lab book and students were able to pinpoint issues in the collection procedure 
of these data. However, when conducting their own trials, they typically failed to conduct 
controlled trials when collecting data. When analyzing data, students again struggled with 
writing explanations for their data. Data from this study further demonstrated the 
complexity and “thorniness” of students’ difficulties with conducting controlled trials. 
Specifically, students do not tend to conduct contrasting trials sequentially, making the 
assessment of their knowledge of how to design controlled experiments very difficult. 
Later in the chapter we describe, in brief, how our algorithms do this, as well as describe 
the need to refine our scaffolds for this critical inquiry practice within the Inq-ITS 
system. 


Case Study 3 
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In the third case study, again with the same student demographic, we used a 
simulated ramp environment (i.¢e., a virtual replica of the one students used in case studies 
1 and 2). Similar to case study 1, the goal here was to see what the students did naturally 
in terms of inquiry but within a virtual, simulated environment. In addition, the tasks 
were slightly more structured than in case study 2 because we determined that a 
structured approach was more effective in teaching how to design controlled experiments 
(Klahr & Nigam, 2004). In addition, data tables that included blank rows and columns 
were provided to the students. Voice data and videos of students’ interactions with the 
simulated ramp apparatus were collected and analyzed for each student. 

The data from this study were very informative with respect to the breadth and 
depth of students’ difficulties with inquiry. Specifically, students again demonstrated 
difficulties with recording data and all students needed to have the columns of the table 
set up for them by the experimenter so that they could correctly record their data. When 
collecting data, they repeated trials, did not collect contrasting trials, did not collect trials 
sequentially, and did not run controlled trials. When interpreting results, they did not 
attend to the appropriate data. Of interest to data interpretation, when students were asked 
to compare their original tables to the ones set up for them by the experimenter, the 
students realized that had changed too many variables, making it harder to see the effect 
on the variable they were trying to test. In the new tables, which were set up for them, 
the outcome was more salient to the students. Lastly, students had difficulties in 
communicating their findings in that their conclusions were not based on their data. 

Case Study 4 

In the fourth case study, students were drawn again from the same demographic 
sample. This case study was similar to Case study 3 in that the simulated ramp 
environment was used; however, we also used a lab book similar to Case study 2. Noting 
earlier results about the consistent and pronounced difficulties conducting controlled 
experiments, the laboratory notebook included a direct explanation about how to collect 
unconfounded data since it was shown that direct instruction on this skill is effective 
(Klahr & Nigam, 2004). This was written so that the student did not need any assistance 
from a human tutor; this, we deemed, would help us in beginning to formalize the 


prompts that would be incorporated into Inq-ITS. 
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Again, we found that students demonstrated problems with inquiry. Specifically, 
when conducting experiments, they did not collect enough data and only showed a small 
improvement in conducting controlled experiments. Moreover, they had problems with 
interpreting data and communicating findings as they had in the previous case studies. 
For example, even though students did not have the data needed to support their 
conclusion, they insisted that they had “discovered the answer” and that their data 
“proved it”. For some students it seemed too obvious that if the ramp is steeper that the 
ball will roll further (e.g., one student commented that “the higher the ball, the faster it 
goes.”) When asked if they saw those results in their table, all students answered “yes”; 
however, they did not give a deeper explanation of how they had demonstrated that the 
higher the steepness of the ramp, the further the ball will go. 

Summary of Case Studies 

All told, the information gathered through think-aloud activities in the case studies 
helped us to greatly understand the breadth, depth, and pervasiveness of students’ 
difficulties across all of the inquiry practices outlined in the NGSS consistent with 
previous findings in the literature. We also used our think-aloud data to identify the level 
of granularity needed to conceptualize and operationalize the sub-skills underlying 
inquiry practices. Moreover, we used analyses of students’ think-aloud data to 
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characterize students’ “natural” inquiry processes with no/minimal support (e.g., graphs, 
tables, widgets, scaffolds) in order to better understand students’ needs for these tasks. 
These kinds of information identified in the hand-coding of our think-aloud protocols 
were valuable to the development of the algorithms needed to measure the sub-skills of 
inquiry and to provide appropriate scaffolds for learning within Inq-ITS. We now 
describe the characteristics of this system in more detail. 
The Inq-ITS System 

As discussed previously, Inq-ITS is a rigorous, technology-based learning 
environment that assesses and scaffolds middle school students as they engage in inquiry 
in Earth, Life, and Physical Sciences. The system can be run either in “pure assessment 
mode” or in “scaffolding mode”, in which our virtual agent, Rex, jumps in to support 
students in real time when needed. In the following section, we describe the key features 


of the Inq-ITS system. 
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Microworlds 

Ing-ITS uses microworlds (Papert, 1980) to engage students in scientific inquiry. 
Microworlds are computerized representations of real-world phenomena whose 
properties can be inspected and changed (Pea & Kurland, 1984; Resnick, 1997). 
Microworlds provide authentic inquiry opportunities because they share many features 
with real apparati for “doing science” (Gobert, in press), thereby providing perceptual 
affordances for the learner. In turn, these perceptual affordances can provide leverage for 
building rich conceptual knowledge (Gobert, 2005). With a microworld a learner can 
pose questions, plan and carry out a virtual experiment with a simulation by collecting 
data, then analyze their data, and then communicate their findings in the form of a 
scientifically warranted explanation. Inq-ITS microworlds are used for performance 
assessment of students’ inquiry practices in that they: (a) are instrumented to log all 
students’ interactions, (b) leverage real time analyses of log files based on knowledge- 
engineering and educational data mining, (c) provide assessment metrics to researchers 
and teachers on each inquiry skill of interest and, (d) can scaffold students’ inquiry 
processes in real time (Gobert et al., 2013). 

In developing microworlds for Inq-ITS, we surveyed the science education and 
learning sciences literature for students’ content misconceptions for each topic across 
Earth, Life, and Physical Sciences to determine which variables, domain-specific 
properties, and domain-specific representations to include in each microworld so that 
students could fully engage with the content in authentic ways to more deeply understand 
the topic and hone their inquiry practices in these domain-specific contexts. For example, 
in the domain of ‘state change’, a common misconception in middle school is that as the 
amount of a substance increases, the temperature at which it will boil also increases. 
Mislevy et al. (2012) refer to this process as the domain analysis in the ECD lifecycle of 
assessment design and delivery; the information-processing literature refers to this more 
generally as task analysis (Newell, 1990). 

An integral part of the Inq-ITS system and associated microworlds is the inclusion 
of inquiry widgets, as mentioned before. These widgets are important in that they 
scaffold students in conducting various steps of inquiry, but are also the basis upon which 


we collect our performance data on students’ inquiry practices. Our inquiry widgets were 
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designed in accordance with the learning sciences and science education literature on 
students’ difficulties in conducting inquiry. 

For example, the ‘question asking/hypothesis’ widget was designed to externalize 
the structure of students’ hypotheses using independent and dependent variables and the 
relationships between them. Using this widget, students’ questions/hypotheses are 
generated in the form of sentences. The resulting data are logged and are then used to 
generate metrics about the sub-skills of inquiry practices such as whether the student has 
included an independent variable, a dependent variable, and a relationship between them. 

The ‘data interpretation/analysis’ widget provides a structure for the student to 
interpret their data after the experimental trials are completed. Similar to the hypothesis 
widget, the data interpretation/analysis widget presents a way for the student to create 
statements about the relationship between the independent and dependent variables from 
their trials and to warrant their claims by selecting their trials to either support or refute 
their hypothesis. A full description of the widgets can be found in Sao Pedro et al. (2011), 
and Gobert et al. (2012, 2013). 

Types of Scaffolding 

As described in our theoretical framework, we believe that middle school 
students, many of whom lack adequate prior content knowledge and have difficulties 
with inquiry as previously described, need guidance in conducting scientific inquiry. The 
degree of structure used to guide students’ general and science-specific inquiry activities 
in learning environments is a topic that was hotly debated in the field of science 
education in the recent past (e.g., Kirschner, Sweller, & Clark, 2006; Hmelo-Silver, 
Chinn, & Duncan, 2006). 

As previously mentioned, Papert’s conception of inquiry with microworlds is 
more open-ended in terms of degree of pedagogical guidance (Papert, 1980; 1993) than 
inquiry with microworlds in our system. Inq-ITS allows a moderate degree of student 
choice, less choice than in purely exploratory learning environments (Amershi & Conati, 
2009; Papert, 1980; 1993) but more choice than in classic model-tracing tutors 
(Koedinger & Corbett, 2006) or constraint-based tutors (Mitrovic et al., 2001). 
Specifically, in Inq-ITS, students’ scientific inquiry is guided in three ways through (1) 
general scaffolding afforded by the system user interface, (2) teacher scaffolding, and (3) 
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adaptive scaffolding by our pedagogical agent Rex, a cartoon dinosaur. We discuss each 
form of scaffolding briefly in the followng. 

Inq-ITS user interface scaffolding. Students’ inquiry in Inq-ITS is guided by the 
order in which information is provided to learners as well as by the widgets and support 
tools provided to learners. For example, we begin by suggesting that they develop and 
ask a question using a set of independent and dependent variables. Students then plan and 
carry out an experiment by collecting data, interpret their data, provide evidence in the 
form of warrants for their claims, and communicate their findings. Inq-ITS has a progress 
bar on the top of the screen to support them in knowing what phase of inquiry they are 
presently in, which is critical to students’ monitoring (de Jong et al., 2005; de Jong, 
2006). Additionally, the artifacts that students generate by using widgets make visible 
and salient both the products and processes of inquiry for the learner in order support 
students’ meta-level understanding of inquiry. 

Teacher-led scaffolding. When in pure assessment mode, assessment of 
students’ inquiry practices is done in a stealth manner (Shute, 2011); that is, 
unobtrusively without taking time from instruction. Formative data collected on students’ 
inquiry practices are provided in real time directly to teachers via an integrated 
assessment report shown in Figure 2, which displays information at both the class-level 
and individual student level. The report is generated via our knowledge-engineered rules 
and data mined algorithms that we describe further below. 

[INSERT FIGURE 2 ABOUT HERE] 

Recently, we completed the development of an alerting platform called /nq- 
Blotter (Sao Pedro, Gobert, & Betts, 2015) that automatically alerts teachers on their 
mobile devices as to which students are having difficulties with inquiry and on which 
inquiry practices and sub-skills of these. With these data literally ‘in hand’, teachers can 
walk around the room and provide assistance to students as they need it, when 
scaffolding is most critical to learning (Koedinger & Corbett, 2006). From these data, a 
teacher might decide to stop the entire class, say, if many do not understand what an 
independent variable is or are not conducting controlled trials, or s/he might decide to go 
over to help individual students in real time if there are only a few students having 


difficulty with a particular skill. Our reports and alerts are designed to be highly readable 
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and to identify the students who need most help for the teacher while they monitor the 
overall class’ performance. These reports and alerts were designed in collaboration 
between our user experience designer, graphic designers, and our partner teachers using 
an iterative design process of conceptual sketches and mock-ups. We then interviewed 
teachers as to the usability of these reports and the levels of aggregation that they needed 
in the reports and tweaked the reports as needed. 

Automated adaptive scaffolding via Rex. Assessment can be done either 
without Rex or with Rex. In the full embodiment of our system with Rex’s scaffolding 
capacity, our goal was to provide the optimal degree of guidance so that students’ inquiry 
skills could be honed in real time, when real time feedback is most effective (Koedinger 
& Corbett, 2006). By running Inq-ITS in scaffolding mode, assessment is seamlessly 
integrated with instruction so that skills can be developed and assessed in the rich 
contexts in which they are developing (Mislevy et al., 2003). 

Rex, our pedagogical agent, provides scaffolds on a particular inquiry practice 
when - and only when - our system detects that the student needs this type of help via our 
data-mined algorithms (for a fuller description see Gobert et al., 2013). In this way, 
Vygotsky’s notion of scaffolding within the zone of proximal development (1978) can be 
realized. This automated scaffolding approach is used instead of on-demand help in 
which students explicitly ask for support (e.g., Anderson et al., 1995) because on-demand 
help requires metacognitive knowledge (Aleven & Koedinger, 2000; Aleven, McLaren, 
Roll & Koedinger, 2004). Given that students have difficulty monitoring their progress 
(de Jong et al., 2005), we had empirical evidence to believe that students would be 
unaware when they are in need of help during inquiry. 

Inq-ITS has four types of automated scaffolds embodied by Rex: (1) orienting 
scaffolds to help students monitor where they are in the inquiry process (de Jong et al., 
2005), (2) conceptual scaffolds to provide students conceptual information needed for the 
current task (e.g., Rex may explain why controlled experiments are important for testing a 
hypothesis), (3) procedural scaffolds to help students on the current task (e.g., Rex may 
instruct the student to “construct a controlled trial, relative to your last trial run” or to 
“design experiments by changing only one variable while keeping the others the same’), 


and (4) instrumental scaffolds to provide the student direct instruction as to what to do on 
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the current task (e.g., Rex may break down strategic help in a step by step fashion, 
providing a type of worked example from which to learn) (Koedinger & Aleven, 2007). 

In short, to us, the goal of providing scaffolded support for inquiry practices via 
the teacher or Rex in the form of orienting messages or conceptual/procedural strategies 
is not equivalent to direct instruction of formulas and rote science facts as characterized 
in Kirschner et al (2006). In Ing-ITS, we scaffold students’ inquiry practices since (1) 
these are not likely to develop naturally (Kuhn, 2005); (2) students can become lost and 
frustrated and their confusion can lead to misconceptions if they are not scaffolded 
(Brown & Campione, 1994), (3) teachers spend considerable time scaffolding students’ 
procedural skills (Aulls, 2002), and (4) there are many “lost” opportunities for learning 
and assessments if students are not guided properly via scaffolds (e.g., if students are not 
testing their hypothesis or if their data are confounded, all subsequent inquiry tasks such 
as data interpretation, warranting claims, and developing explanations are moot since 
their data do not afford the possibility of successfully completing these tasks due to 
“buggy” data collected during data collection phase of inquiry). 

Illustrative Vignette 

To make the previous ideas more concrete, we now present a small vignette in the 
context of a ‘states of matter’ microworld that students use to conduct inquiry to 
determine the effects of the independent variables (e.g., level of heat, amount of 
substance) on the dependent variables (e.g., time to melt, temperature when melted); see 
Figure 3 for a screenshot of this microworld. 

[INSERT FIGURE 3 ABOUT HERE] 

After the student has had some time to explore the microworld, the student uses 
the ‘question asking/hypothesis’ widget to generate a hypothesis. When the student 
finishes this, it is checked for correctness using a knowledge-engineered rule 
(Feigenbaum & McCorduck, 1983). If the student incorrectly enters a dependent variable 
in place of an independent variable, this particular sub-skill (i.e. to distinguish 
independent from dependent variables) is auto-scored as ‘incorrect’ and the teacher report 
shown earlier in Figure 2 will be auto-populated with this information so the teacher will 
always know the status of his/her students’ inquiry skills; such timely feedback is critical 


to deep learning. 
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Students then design and conduct their experiment by first selecting variables to 
manipulate and then running a simulation. This provides students additional opportunities 
to demonstrate their understanding of independent and dependent variables as well as 
their understanding of how to test a hypothesis using controlled trials. As students collect 
data by running trials with the simulation, it is highly likely that some students will not 
design controlled experiments whereby one variable is systematically targeted across 
trials and all other variables are held constant (Chen & Klahr, 1999). Once a number of 
trials have been completed, our data-mined assessment algorithm is able to assess 
whether students are successfully demonstrating this skill (Sao Pedro, Baker, & Gobert, 
2012). If not, this information is automatically updated in the teacher report. Again, as the 
teacher helps one student, our algorithms continually update the report/alerts.Once 
sufficient data are gathered to support or refute the hypothesis, students interpret their 
data and warrant their claims using the ‘analysis interpretation’ widget. Again, as in 
hypothesis formation, a knowledge-engineered rule checks the student’s interpretation 
both for correctness, and whether they have selected the correct data to warrant their 
claim. The report indicates which students are most in need of help on which skill(s). 
Generation of Assessment Metrics 

The log files of students’ actions collected unobtrusively and in situ within the 
Inq-ITS system provide a fertile basis upon which to generate performance-based 
assessments of rich inquiry processes (Clarke-Midura, Dede, & Norton, 2011). 
Additionally, the resulting evidence about key student competencies from log files can be 
connected to the evidence from the artifacts or products they create as a result of the 
captured activities (see, e.g., Rupp et al., 2010). 

As alluded to previously, our system assesses inquiry practices using a 
combination of knowledge-engineered rules (Feigenbaum & McCorduck, 1983) and data 
mined algorithms (Romero & Ventura, 2007; Baker & Yacef, 2009), depending on 
whether the inquiry practice of interest is more well-defined or more ill-defined. 
Knowledge-engineering techniques (see Shute & Glaser, 1990; Schunn & Anderson, 
1998), which work best for well-defined domains such as mathematics problem solving 
(e.g., Koedinger & Corbett, 2006) and computer programming tasks (Corbett & 


Anderson, 1995), are used for inquiry practices or subskills that are similarly well- 
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defined such as ‘identifying independent versus dependent variables’ in developing 
hypotheses or certain aspects of ‘data interpretation/analysis’. 

By contrast, educational data mining techniques are used to assess inquiry 
practices for which skilled performance can manifest itself in several ways and more 
complex disambiguation of evidence is necessary; for a thorough review of the methods 
we used see Gobert et al., 2012, 2013; Sao Pedro et al., 2011. Two examples of inquiry 
practices that fall into this category are ‘testing stated hypotheses’ and ‘planning and 
carrying out experiments’ since there are a number of approaches that students can take 
on these, reflecting both skilled and unskilled performance (Shute, Glaser, & Raghavan, 
1989; Kuhn, 2005). 

We validated our assessment algorithms for these inquiry practices with 
thousands of middle school students. Specifically, our data-mined algorithm for 
evaluating whether students are testing their articulated hypothesis matches a human 
scorer 91% of the time. Similarly, our data-mined algorithm for assessing students’ skills 
at designing controlled experiments can distinguish controlled vs. confounded data 
collection 94% of the time (Sao Pedro et al., 2012). Furthermore, the latter algorithm can 
assess whether students are conducting controlled experiments even when students do not 
conduct their trials sequentially. Viewed this way, our work represents a large 
methodological advance over other assessments of this skill that evaluate only 
information from sequential trials as evidence of this skill (McElhaney & Linn, 2010; 
Klahr & Nigam, 2004) or evaluate only information from any two contrasting trials 
regardless of whether they are sequential or not. The former approach is too stringent an 
assessment of this since skill students do not necessarily conduct trials sequentially. The 
latter approach is too lenient an assessment of this skill since one cannot know whether 
the two non-sequential trials were conducted by chance or were collected deliberately to 
be contrasted by the student. 

As of this writing, we have developed and validated assessment algorithms for all 
the NGSS science practices. We have also shown their generalizability across multiple 
domains (Sao Pedro, Jiang, Paquette, Baker, & Gobert, 2014; Sao Pedro, Gobert, Toto, & 
Paquette, 2015; Gobert, Kim, Sao Pedro, Kennedy, & Betts, in press). 


Summary 
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In an educational system like the one in the United States, which is dedicated to 
standardization and accountability, inquiry in science classrooms cannot take center stage 
until considerable progress has been made on inquiry assessment. This is where “the 
rubber meets the road”, and success with reform outlined in policy documents such as the 
NGSS rests, in our opinion, on rigorous assessment of students’ science practices. In this 
chapter we described Ing-ITS, an on-line environment for assessing and scaffolding 
students’ inquiry practices. We described how the NGSS, the learning sciences literature, 
the literature on students’ difficulties with inquiry, and our early pilot work with students 
with hands-on inquiry tasks informed the design of Inq-ITS and how our system 
measured up in terms of granularity, measurability, and instructional relevance of the 
assessed scientific practices (Leighton & Gierl, 2011). 

We argue that our system reflects several key advances in assessment design and 
practice in several areas. Perhaps most notably, the use of sophisticated knowledge- 
engineered and data-mined models allowed us to (1) assess students’ inquiry practices in 
real time, (2) generate teacher reports and alerts in real time, and (3) trigger a pedagogical 
agent to scaffold inquiry in real time, all of which are critical to deep learning (Black & 
Wiliam, 1998; Pellegrino et al., 2001). This has several related benefits. First, teachers do 
not need to use additional instructional time for assessment because they receive 
immediate reports and alerts about their students and know who and what inquiry 
practices to focus on during instruction. Second, since our assessment algorithms work in 
real time over the web, the data-mined models are able to continually capture their 
emerging learning trajectory, which is ideal for continual adaptive assessment, adaptive 
instruction, and effective learning (Klahr & Nigam, 2004; Vygotsky, 1978). 

Third, from a research perspective, the continual data that the Inq-ITS system 
provides allows us to advance our understanding of how students both conduct inquiry 
and hone these practices of inquiry over time. Finally, due to the sophistication of the 
microworld and widget design as well as the design of the underlying computational 
architecture, the Inq-ITS system can be scaled to many users simultaneously. Due to 
these functionalities, Inq-ITS and systems like it will continue to reduce or eliminate the 
separation between learning activities and assessment activities, allowing us to realize 


both the long-range vision for learner-centered environments (Quellmalz & Pellegrino, 
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2009; Quellmalz et al., 2012) as well as the rigorous assessment of inquiry practices as 


called for in the NGSS. 
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Table 1 


Next Generation Science Standards (NGSS Lead States, 2013) 


Life Science 


Physical Science 


LS1: From Molecules to Organisms: 
Structures and Processes 

LS2: Ecosystems: Interactions, Energy, 
and Dynamics 

LS3: Heredity: Inheritance and 
Variation of Traits 

LS4: Biological Evolution: Unity and 


Diversity 


PS1: Matter and Its Interactions 

PS2: Motion and Stability: Forces and 
Interactions 

PS3: Energy 

PS4: Waves and Their Applications in 


Technologies for Information Transfer 


Earth & Space Science 


Engineering & Technology 


ESS1: Earth’s Place in the Universe 
ESS2: Earth’s Systems 
ESS3: Earth and Human Activity 


ETS1: Engineering Design 
ETS2: Links Among Engineering, 


Technology, Science, and Society 
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Figure I Model-based learning and teaching framework. 
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Class Performance 


Class: Mr. Green — Section 3 # Students by Skill 
6 12 12 
Hypothesis Formation 
Identify Independent Variable (OO 
Identify Dependent Variable QO 
Relationship between variables QD 
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Design & Conduct Experiments 


Control for Variables Strategy 


Targeting Independent variables 
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Correct or incorrect claims 


Selecting trials to warrant claims 


Communicating Findings 
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May not understand the concept of 
variables 


Name Skill (%) 
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Figure 2 Assessment report for teachers reported out by inquiry practice and sub-skills. 
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Figure 3 Screenshot of the Inq-ITS system with all phases of inquiry shown. Students 
generate a question, run trials to test it, then interpret data, and select trials to warrant 
claims with evidence. 
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